week 10: data handling
week 11: heatmaps
week 12: merging tables
week 13: data modeling basics
week 14: simple shiny apps
week 15: exam (9th of February)
In the third cycle we will use different data sets for the different
topics we want to cover.
topic: loading in and manipulating that is not very clean or has some
special features
data needs to be downloaded from the cloud (folder week 10)
we will use two data sets:
parclip_data.csv
-is biological data that we only want to practice loading in
properly
-published in Hoell et al., RNA targets of wild-type and mutant FET
family proteins. Nat Struct Mol Biol. 2011 Nov 13;18(12):1428-31. doi:
10.1038/nsmb.2163.
GSE174302_all-reads-available-samples.txt
-is biological/bioinformatical data from cell-free RNA sequencing
-published in Chen et al., Cancer type classification using plasma
cell-free RNAs derived from human and microbes. Elife. 2022 Jul
11;11:e75181. doi: 10.7554/eLife.75181.
-includes different patients (in the columns starting from col2) and the
expression values of the different genes listed in column 1
(“feature”)
-the feature column does not only contain the gene name but also more
information like the ensembl id, transcript length,…
## load the tidyr package to use piping and different functions for data frame manipulation
library(tidyr)
The first task is to load in the parclip_data.csv
parclip_data <- read.csv(file = "C://Users/Julia/Downloads/parclip_data.csv")
parclip_data %>% View()
This is the regular way to read in the .csv-file.
But as you can see the data frame is not usable like that.
The first problem is that we need to skip eight rows because these
were used for annotations and not actual data.
parclip_data <- read.csv(file = "C://Users/Julia/Downloads/parclip_data.csv", skip = 8)
parclip_data %>% View()
The data is still a mess because the values are not separated with a
comma (the default separator) but a semicolon.
The separator can be chosen using sep = ““.
parclip_data <- read.csv(file = "C://Users/Julia/Downloads/parclip_data.csv", skip = 8, sep = ";")
parclip_data %>% View()
Now we want to load the GSE174302_all-reads-available-samples.txt
file.
As it is not a .csv-file we use the read.table() function now instead of
read.csv().
seq_data <- read.table(file = "C://Users/Julia/Downloads/GSE174302_all-reads-available-samples.txt")
seq_data %>% View()
The data is loaded properly, but the first row is not used as the
header.
This can be done using the “header” variable, set it to “TRUE” (or
abbreviated to “T”).
seq_data <- read.table(file = "C://Users/Julia/Downloads/GSE174302_all-reads-available-samples.txt", header = T)
seq_data %>% View()
Actually alternatively you can also use the read.csv() function
instead if you choose the correct separator.
# seq_data <- read.csv(file = "C://Users/Julia/Downloads/GSE174302_all-reads-available-samples.txt", sep = "\t")
# \t is the abbreviation for tab because here the values are separated by tabs instead of commas (or semicolons)
Now we want to work with the seq_data. The first problem we are
facing is that the first column (called “feature”) contains a lot of
information which should be separated into several columns. By this you
can, for example, properly filter according to these features.
We can use the separate() function for this from the package
tidyr.
For this you have to define
1. which column should be separated (column named “feature” in our
case)
2. into which columns they are supposed to be separated (as there are
multiple we have to combine them and we choose the names)
3. by which symbol the columns should be separated (in this case it’s
the symbol |)
Problem: we can not just say sep = “|” because this is also the Boolean
operator “OR”. Instead we have to “protect” the symbol by using \ as
shown here:
seq_data %>%
separate(col = feature, into = c("ensembl_id", "biotype", "gene_name", "ensembl_id2","ensembl_id3", "unknown", "trancript_length"), sep = "\\|")
The “feature” column is now split into seven new columns with the titles we chose and the information that was separated by the | symbol. Thereby the original “feature” column disappears.
Next we want to remove the last values of the ensembl_id after the .
(e.g. convert ENSG00000000003.14 to ENSG00000000003). Here, we can use
the stringr package and regular expressions again.
# load stringr and dplyr package
library(stringr)
library(dplyr)
# separate the "feature" column into several columns and remove the last digits from the ensembl_id, assign the manipulated data set to a new variable
seq_data_clean <- seq_data %>%
separate(col = feature, into = c("ensembl_id", "biotype", "gene_name", "ensembl_id2","ensembl_id3", "unknown", "transcript_length"), sep = "\\|") %>%
mutate(ensembl_id = str_remove(string = ensembl_id, pattern = "\\..*"))
For the pattern we want to remove we want to address a dot, a random symbol and whatever comes afterwards. But we can not just use the dot within the pattern because the dot is also the “wildcard symbol” (any symbol). We need to make sure that we target the actual dot and not a random symbol by using two in front of the dot. Additionally we want to remove the random symbol afterwards (indicated by the second dot) and whatever comes next (indicated by * or +). (Alternatively we can also only use the “protected dot” and the asterisk or plus afterwards because then we target the dot and whatever comes next. But for practice we wanted to show both meanings of the dot symbol here.)
Next we want to add a new column to the table called “length_category”. In this we categorize the transcript length values as “short” (up to 2000) or “long” (>2000) using the if_else() function.
# for this we create a new data set called "test_data" that includes only the first 5 rows of the seq_data_clean (so everything runs faster)
test_data <- seq_data_clean %>%
head(5)
# we add a new column with the length_categorization according to the transcript_length and assign it to a new variable
new_test_data <- test_data %>% mutate(length_category = if_else(condition = transcript_length > 2000, true = "long", false = "short"))
Now we save the newly generated table as .csv-file using the write.csv()-function. There you only have to define the data frame you want to save (x) and the path where it is supposed to be saved including the file name and .csv-ending (file)
write.csv(x = new_test_data, file = "C://Users/Julia/Desktop/new_test_data.csv")
ideal for showing 3 variables in one plot, e.g. how many calls a
company gets per weekday and per hour of day (2 dimensions day + time
and the number of calls indicated by the color gradient)
heatmaps many times contain dendrograms. Dendograms are formed by
hierarchical clustering. They show relationships between objects by
branches. In heatmaps, dendrograms shows how similar samples are. The
color indicates which attributes of the data are responsible for the
clustering of the dendrogram.
there are a lot of different ways in R to create heatmaps e.g. a
base R function and several packages. Our personal recommendation is
pheatmap
most important is the shape of the input data frame: the data
frame (or matrix) need to be in the wide format and all
data must be numeric. Rows and columns for labeling the
samples need to be converted to row names and
column names. Additional rows or columns that shouldn’t
be plotted need to be deleted. For data wrangling different packages
like dplyr and tibble can be used. There is a whole library collection
called “tidyverse” that includes useful packages like these dplyr,
tibble, tidyr, ggplot2 and others.
-if you want (and are able to) download tidyverse you can activate
all of them at once
-if the tidyverse installation failed, we can also install/load the
single libraries dplyr, ggplot2, tibble
Data source: https://opendata.dwd.de/climate_environment/CDC/regional_averages_DE/
modified table with mean temperatures in Germany per year + per
month
the months of the data are indexed by numbers. E.g. 1 means
January, 2 means February etc
for creating the heatmap, we need to convert the data into the
wide format (columns are months, rows are the years)
using pivot_wider()
the column_to_rownames() function from the tibble library
converts a specified column into the row names
colnames() function can be used to rename the column
names
## install (in necessary) and load tidyverse packages (or the whole tidyverse package collection instead if you have it)
library(tibble)
library(ggplot2)
library(dplyr)
library(tidyr)
## install and load the pheatmap library
library(pheatmap)
## load data
meantemp_germany <- read.csv(file = "C:/Users/Julia/Desktop/Promotion/teaching/dataanalysis_in_R/heatmap_lecture/meantemp_germany.csv", sep = ",")
## modify data (wide format, all data needs to be numeric)
meantemp_germany_clean <- meantemp_germany %>%
select(year, month, mean_temp_ger) %>%
filter(year >= 1972) %>%
pivot_wider(names_from = month, values_from = mean_temp_ger)
meantemp_germany_clean <- column_to_rownames(meantemp_germany_clean, 'year')
colnames(meantemp_germany_clean) <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
#default pheatmap function
pheatmap(meantemp_germany_clean)
#turn off default clustering
pheatmap(meantemp_germany_clean, cluster_rows = FALSE, cluster_cols = FALSE)
#change border colors
pheatmap(meantemp_germany_clean, cluster_rows = FALSE, cluster_cols = FALSE, border_color = "white")
pheatmap(meantemp_germany_clean, cluster_rows = FALSE, cluster_cols = FALSE, border_color = "black")
pheatmap(meantemp_germany_clean, cluster_rows = FALSE, cluster_cols = FALSE, border_color = F)
#add gaps between columns and rows
pheatmap(meantemp_germany_clean, cluster_rows = FALSE, cluster_cols = FALSE, border_color = F, gaps_col = c(3, 6, 9), gaps_row = c(9, 19, 29, 39, 49))
#add a title to the plot
pheatmap(meantemp_germany_clean, cluster_rows = FALSE, cluster_cols = FALSE, border_color = F, gaps_col = c(3, 6, 9), gaps_row = c(9, 19, 29, 39, 49), main = "Mean temperatures in Germany 1972 - 2022")
##add annotation
#create annotation data frame
season_df <- data.frame("season" = c("Winter", "Winter", "Spring", "Spring", "Spring", "Summer", "Summer", "Summer", "Autumn", "Autumn", "Autumn", "Winter"))
#change row names
row.names(season_df) <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
pheatmap(meantemp_germany_clean, cluster_cols = F, cluster_rows = F, border_color = F, gaps_col = c(3, 6, 9), gaps_row = c(9, 19, 29, 39, 49), annotation_col = season_df)
angle_col argument# change label angle of columns (270, 0, 45, 90, 315 degree are possible)
pheatmap(meantemp_germany_clean, cluster_cols = F, cluster_rows = F, border_color = "white", gaps_col = c(3, 6, 9), gaps_row = c(9, 19, 29, 39, 49), annotation_col = season_df, angle_col = 45)
display_numbers# change label angle of columns
pheatmap(meantemp_germany_clean, cluster_cols = F, cluster_rows = F, border_color = "white", gaps_col = c(3, 6, 9), gaps_row = c(9, 19, 29, 39, 49), annotation_col = season_df, angle_col = 45, display_numbers = T)
# assign heatmap to a variable
temp_heatmap <- pheatmap(meantemp_germany_clean, cluster_cols = F, cluster_rows = F, border_color = "white", gaps_col = c(3, 6, 9), gaps_row = c(9, 19, 29, 39, 49), annotation_col = season_df, angle_col = 45, display_numbers = T)
# save the plot as e.g. png (jpg, svg, pdf etc is also possible, just change the file name extension)
ggsave(
"~/Desktop/heatmap.png", #path where it is supposed to be saved + file name + file name extension of chosen format
plot = temp_heatmap, #plot you want to save
width = 15, #optional
height = 20, #optional
units = c("cm") #optional
)
With these settings the annotation labels and the column label are a
bit cropped. To solve this problem I changed the cellwidth
argument.
# assign heatmap to a variable
temp_heatmap <- pheatmap(meantemp_germany_clean, cluster_cols = F, cluster_rows = F, border_color = "white", gaps_col = c(3, 6, 9), gaps_row = c(9, 19, 29, 39, 49), annotation_col = season_df, angle_col = 45, display_numbers = T, cellwidth = 20)
# save the plot as e.g. png (jpg, svg, pdf etc is also possible, just change the file name extension)
ggsave(
"~/Desktop/heatmap.png", #path where it is supposed to be saved
plot = temp_heatmap, #plot you want to save
width = 15, #optional
height = 20, #optional
units = c("cm") #optional
)
## change annotation colors
#create color list
season_color = list(season = c("Winter" = "blue", "Spring" = "green", "Summer" = "red", "Autumn" = "yellow"))
#add annotation colors argument and the color list
pheatmap(meantemp_germany_clean, cluster_cols = F, cluster_rows = F, border_color = "white", gaps_col = c(3, 6, 9), gaps_row = c(9, 19, 29, 39, 49), annotation_col = season_df, annotation_colors = season_color)
#here is an overview of all the color names you can use implemented in R
#http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
season_color_more = list(season = c("Winter" = "skyblue", "Spring" = "darkolivegreen3", "Summer" = "sienna1", "Autumn" = "lightgoldenrod1"))
#add annotation colors argument and the color list
pheatmap(meantemp_germany_clean, cluster_cols = F, cluster_rows = F, border_color = "white", gaps_col = c(3, 6, 9), gaps_row = c(9, 19, 29, 39, 49), annotation_col = season_df, annotation_colors = season_color_more)
#there are also other color formats you can pick colors from, e.g. the hex format so you have more options/shades
season_color_hex = list(season = c("Winter" = "#4477AA", "Spring" = "#228833", "Summer" = "#EE6677", "Autumn" = "#CCBB44"))
#add annotation colors argument and the color list
pheatmap(meantemp_germany_clean, cluster_cols = F, cluster_rows = F, border_color = "white", gaps_col = c(3, 6, 9), gaps_row = c(9, 19, 29, 39, 49), annotation_col = season_df, annotation_colors = season_color_hex)
############################################
#skip row names
#if you want to label only every second row e.g. you can do so by entering the row labels within the labels_rows function
#all the rows you want to skip have to be indicated by empty characters ""
pheatmap(meantemp_germany_clean, cluster_cols = F, cluster_rows = F, border_color = "white", gaps_col = c(3, 6, 9), gaps_row = c(9, 19, 29, 39, 49), annotation_col = season_df, angle_col = 45, display_numbers = T, labels_row = c("1972", "", "1974", "", "1976", "", "1978", "", "1980", "", "1982", "", "1984", "", "1986", "", "1988", "", "1990", "", "1992", "", "1994", "", "1996", "", "1998", "", "2000", "", "2002", "", "2004", "", "2006", "", "2008", "", "2010", "", "2012", "", "2014", "", "2016", "", "2018", "", "2020", "", "2022"))
############################################
#change color of the gradient
#if you want to change the colors of the gradient there are a lot of resources you can pick different colors or (diverging) color schemes from
#here I will provide a small overview of functions you can use but there are way more options
#define the color gradient always using the color argument
### hcl_palettes
## using the link you can find a collection of hcl color palettes where you can pick one from by using hcl.colors and entering the name
##the number indicates the intersections of the gradient
## https://colorspace.r-forge.r-project.org/articles/hcl_palettes.html
pheatmap(meantemp_germany_clean, cluster_cols = F, cluster_rows = F, color = hcl.colors(100, "Blue-Red"))
pheatmap(meantemp_germany_clean, cluster_cols = F, cluster_rows = F, color = hcl.colors(100, "Green-Brown"))
### colorRampPalette
##you can pick colors and create your own color scheme using colorRampPalette
##in this case you should define the colors manually and save the function (e.g. "my_colors")
##then call the list in the color argument
my_colors <- colorRampPalette(c("skyblue3", "white", "violetred3")) # Manual color range
pheatmap(meantemp_germany_clean, cluster_cols = F, cluster_rows = F, color = my_colors(100), border_color = "black")
### using the color brewer package
#https://www.geeksforgeeks.org/introduction-to-color-palettes-in-r-with-rcolorbrewer/
#install.packages("RColorBrewer")
library("RColorBrewer")
##show (diverging) color schemes in the color brewer package, you can define to only get color blind friendly schemes
display.brewer.all(type="div", colorblindFriendly = T)
##choose a scheme, assign it to a color function (n indicates intensity)
my_colors2 <- colorRampPalette(brewer.pal(n=9,name="PiYG"))(255)
pheatmap(meantemp_germany_clean, cluster_cols = F, cluster_rows = F, color = my_colors2, border_color = F)
a small subset of the sequencing data we already used in lecture
10
# load data
seq_data_clean <- read.csv(file = "C:/Users/Julia/Desktop/Promotion/teaching/dataanalysis_in_R/heatmap_lecture/seq_data_clean.csv", sep = ",")
# get rid of unnecessary X column (you can also use select instead as we did above)
seq_data_clean$X <- NULL
# use gene names column as row names
seq_data_clean <- column_to_rownames(seq_data_clean, "gene_name")
# generate default heatmap
pheatmap(seq_data_clean)
The default heatmap shows values with a lot of low and a few very
high values. According to these values the color gradient is set. So it
is hard to see differences between within all lowly expressed genes
(almost the whole heatmap appears blue). In this case
normalization is necessary. There is a normalization
function implemented in pheatmap and you can decide if you want to use
it to normalize the rows or the columns. In our case we have to
normalize the rows (the different expression values per gene). There are
a lot more normalization functions calculating different things (e.g. in
base R). If you want to use one of them for your data please check the
documentation what actually is calculated there. Here, the mean and
standard deviation are calculated (per row = gene) and then it
calculates (value-mean)/standard deviation. Zero means the value equals
the mean, positive values are higher than the mean, negative numbers
lower than the mean.
# normalize the data per row
pheatmap(seq_data_clean, scale = "row")
First, we want to create example data frames. Later on we will merge
them with each other.
For creating the example data frames we are using the data.frame()
function (base R). Within the function you have to specify your column
name and the values (as a combination). If you want to create several
columns they need to be separated by comma.
library(dplyr)
#create example data frame 1 with one column called "id" and the values 1 and 2 and a second column "fruits" with the values apple and banana
df1 <- data.frame(id = c(1, 2),
fruits = c("apple", "banana"))
#second data frame with the same pattern
df2 <- data.frame(id = c(2, 3),
vegetables = c("tomato", "potato"))
In the package dplyr, there are a lot of different functions for
merging data frames called inner_join,
full_join, left_join,
right_join and anti_join with
different settings.
Within the functions you always have to specify the data frames to be
merged and the key by which they are supposed to be merged (= the column
they have in common).
The inner_join function creates a new data frame
including all observations that you have in both data
frames (according to the key). This is a good option if you want to work
only with the observations present in both columns, otherwise you lose a
lot of information.
#merge only the observations found in both data frames (according to the key)
inner_join(df1, df2, by = "id")
## id fruits vegetables
## 1 2 banana tomato
If you want to keep all the data even if the keys
are not found in both data frames you need the function
full_join. This is a save option to not lose any data
but you can get a lot of NA values.
#merge all observations
full_join(df1, df2, by = "id")
## id fruits vegetables
## 1 1 apple <NA>
## 2 2 banana tomato
## 3 3 <NA> potato
Using left_join you keep everything in the “left”
data frame (= the one you put first, e.g. in this case df1) and only
values from the “right” data frame (= df2) are kept that match the
key.
So in this case the order of the data frames is important.
If you want to keep all vegetables and only want to add the fruits with
matching keys you have two options:
1. put df2 as “left” (change the order) or 2. use
right_join
#merge the data frames based on df1's keys
left_join(df1, df2, by = "id")
## id fruits vegetables
## 1 1 apple <NA>
## 2 2 banana tomato
#merge the data frames based on df2's keys
left_join(df2, df1, by = "id")
## id vegetables fruits
## 1 2 tomato banana
## 2 3 potato <NA>
right_join(df1, df2, by = "id")
## id fruits vegetables
## 1 2 banana tomato
## 2 3 <NA> potato
#create a new df2 with capitalized ID instead of id
#now the keys are not the same
df2 <- data.frame(ID = c(2,3),
vegetables = c("tomato", "potato"))
#merge the data frames by defining keys for both data frames ("id" refers to df1, "ID" to df2)
inner_join(df1, df2, by = c("id" = "ID"))
## id fruits vegetables
## 1 2 banana tomato
#create df1 with a duplicate key (2 is included twice)
df1 <- data.frame(id = c(1, 2, 2, 3),
fruits = c("apple", "banana", "banana", "orange"))
#create df2 with single keys
df2 <- data.frame(id = c(2, 3),
vegetables = c("tomato", "potato"))
#merge the data frames
left_join(df1, df2, by = "id")
## id fruits vegetables
## 1 1 apple <NA>
## 2 2 banana tomato
## 3 2 banana tomato
## 4 3 orange potato
Now we have duplicated the tomato in the vegetable column because the
key was “called” twice by the first data frame.
This might be harmful to your data but there is no warning or error
message for that.
So you have to keep this in mind and think of a way to check that your
data is still correct after merging (=sanity check).
This could be done e.g. using the count function before and after
merging to see if the number of observations changes.
#creating two data frames with one key they have in common (germany) and another key that is slightly different (denmark vs. denmark.)
df1 <- data.frame(id = c("germany", "denmark"),
fruits = c("apple", "banana"))
df2 <- data.frame(id = c("germany", "denmark."),
vegetables = c("tomato", "potato"))
#merge the data frames
anti_join(df1, df2, by = "id")
## id fruits
## 1 denmark banana
anti_join now gives you “denmark” because it is part of df1 but not of df2.